SML HW4 : Formula 1 World Championship Analysis Using Unsupervised Learning¶
Team Members: Alekhiya, Devi and Hrishabh¶
Question 1: Can we group F1 drivers by their race-day performance profiles across seasons using data from results, qualifying, lap times and pit stops?¶
%pip install yellowbrick
%pip install plotly
%pip install tabulate
Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.12/site-packages (1.5) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (3.8.4) Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.11.4) Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.4.2) Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (1.26.4) Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.12/site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.2.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (23.2) Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.9.0.post0) Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.12/site-packages (5.22.0) Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (8.2.2) Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (23.2) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: tabulate in /opt/anaconda3/lib/python3.12/site-packages (0.9.0) Note: you may need to restart the kernel to use updated packages.
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.metrics import silhouette_score
from collections import Counter
import plotly.graph_objects as go
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import colors as mcolors
from tabulate import tabulate
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from fancyimpute import SoftImpute
Load Data¶
dataframes = {
name: pd.read_csv(f"{name}.csv") for name in [
"circuits", "constructor_results", "constructor_standings", "constructors",
"driver_standings", "drivers", "lap_times", "pit_stops",
"qualifying", "races", "results", "seasons",
"sprint_results", "status"
]
}
# Visualize shape and sample of each dataset
for name, df in dataframes.items():
print(f"\n==== {name.upper()} ====")
print(f"Shape: {df.shape}")
print(df.head())
==== CIRCUITS ====
Shape: (77, 9)
circuitId circuitRef name location \
0 1 albert_park Albert Park Grand Prix Circuit Melbourne
1 2 sepang Sepang International Circuit Kuala Lumpur
2 3 bahrain Bahrain International Circuit Sakhir
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló
4 5 istanbul Istanbul Park Istanbul
country lat lng alt \
0 Australia -37.84970 144.96800 10
1 Malaysia 2.76083 101.73800 18
2 Bahrain 26.03250 50.51060 7
3 Spain 41.57000 2.26111 109
4 Turkey 40.95170 29.40500 130
url
0 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 http://en.wikipedia.org/wiki/Istanbul_Park
==== CONSTRUCTOR_RESULTS ====
Shape: (12625, 5)
constructorResultsId raceId constructorId points status
0 1 18 1 14.0 \N
1 2 18 2 8.0 \N
2 3 18 3 9.0 \N
3 4 18 4 5.0 \N
4 5 18 5 2.0 \N
==== CONSTRUCTOR_STANDINGS ====
Shape: (13391, 7)
constructorStandingsId raceId constructorId points position \
0 1 18 1 14.0 1
1 2 18 2 8.0 3
2 3 18 3 9.0 2
3 4 18 4 5.0 4
4 5 18 5 2.0 5
positionText wins
0 1 1
1 3 0
2 2 0
3 4 0
4 5 0
==== CONSTRUCTORS ====
Shape: (212, 5)
constructorId constructorRef name nationality \
0 1 mclaren McLaren British
1 2 bmw_sauber BMW Sauber German
2 3 williams Williams British
3 4 renault Renault French
4 5 toro_rosso Toro Rosso Italian
url
0 http://en.wikipedia.org/wiki/McLaren
1 http://en.wikipedia.org/wiki/BMW_Sauber
2 http://en.wikipedia.org/wiki/Williams_Grand_Pr...
3 http://en.wikipedia.org/wiki/Renault_in_Formul...
4 http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso
==== DRIVER_STANDINGS ====
Shape: (34863, 7)
driverStandingsId raceId driverId points position positionText wins
0 1 18 1 10.0 1 1 1
1 2 18 2 8.0 2 2 0
2 3 18 3 6.0 3 3 0
3 4 18 4 5.0 4 4 0
4 5 18 5 4.0 5 5 0
==== DRIVERS ====
Shape: (861, 9)
driverId driverRef number code forename surname dob \
0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07
1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10
2 3 rosberg 6 ROS Nico Rosberg 1985-06-27
3 4 alonso 14 ALO Fernando Alonso 1981-07-29
4 5 kovalainen \N KOV Heikki Kovalainen 1981-10-19
nationality url
0 British http://en.wikipedia.org/wiki/Lewis_Hamilton
1 German http://en.wikipedia.org/wiki/Nick_Heidfeld
2 German http://en.wikipedia.org/wiki/Nico_Rosberg
3 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso
4 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen
==== LAP_TIMES ====
Shape: (589081, 6)
raceId driverId lap position time milliseconds
0 841 20 1 1 1:38.109 98109
1 841 20 2 1 1:33.006 93006
2 841 20 3 1 1:32.713 92713
3 841 20 4 1 1:32.803 92803
4 841 20 5 1 1:32.342 92342
==== PIT_STOPS ====
Shape: (11371, 7)
raceId driverId stop lap time duration milliseconds
0 841 153 1 1 17:05:23 26.898 26898
1 841 30 1 1 17:05:52 25.021 25021
2 841 17 1 11 17:20:48 23.426 23426
3 841 4 1 12 17:22:34 23.251 23251
4 841 13 1 13 17:24:10 23.842 23842
==== QUALIFYING ====
Shape: (10494, 9)
qualifyId raceId driverId constructorId number position q1 \
0 1 18 1 1 22 1 1:26.572
1 2 18 9 2 4 2 1:26.103
2 3 18 5 1 23 3 1:25.664
3 4 18 13 6 2 4 1:25.994
4 5 18 2 2 3 5 1:25.960
q2 q3
0 1:25.187 1:26.714
1 1:25.315 1:26.869
2 1:25.452 1:27.079
3 1:25.691 1:27.178
4 1:25.518 1:27.236
==== RACES ====
Shape: (1125, 18)
raceId year round circuitId name date \
0 1 2009 1 1 Australian Grand Prix 2009-03-29
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05
2 3 2009 3 17 Chinese Grand Prix 2009-04-19
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26
4 5 2009 5 4 Spanish Grand Prix 2009-05-10
time url fp1_date \
0 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N
1 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N
2 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N
3 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N
4 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N
fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time \
0 \N \N \N \N \N \N \N
1 \N \N \N \N \N \N \N
2 \N \N \N \N \N \N \N
3 \N \N \N \N \N \N \N
4 \N \N \N \N \N \N \N
sprint_date sprint_time
0 \N \N
1 \N \N
2 \N \N
3 \N \N
4 \N \N
==== RESULTS ====
Shape: (26759, 18)
resultId raceId driverId constructorId number grid position \
0 1 18 1 1 22 1 1
1 2 18 2 2 3 5 2
2 3 18 3 3 7 7 3
3 4 18 4 4 5 11 4
4 5 18 5 1 23 3 5
positionText positionOrder points laps time milliseconds \
0 1 1 10.0 58 1:34:50.616 5690616
1 2 2 8.0 58 +5.478 5696094
2 3 3 6.0 58 +8.163 5698779
3 4 4 5.0 58 +17.181 5707797
4 5 5 4.0 58 +18.014 5708630
fastestLap rank fastestLapTime fastestLapSpeed statusId
0 39 2 1:27.452 218.300 1
1 41 3 1:27.739 217.586 1
2 41 5 1:28.090 216.719 1
3 58 7 1:28.603 215.464 1
4 43 1 1:27.418 218.385 1
==== SEASONS ====
Shape: (75, 2)
year url
0 2009 http://en.wikipedia.org/wiki/2009_Formula_One_...
1 2008 http://en.wikipedia.org/wiki/2008_Formula_One_...
2 2007 http://en.wikipedia.org/wiki/2007_Formula_One_...
3 2006 http://en.wikipedia.org/wiki/2006_Formula_One_...
4 2005 http://en.wikipedia.org/wiki/2005_Formula_One_...
==== SPRINT_RESULTS ====
Shape: (360, 16)
resultId raceId driverId constructorId number grid position \
0 1 1061 830 9 33 2 1
1 2 1061 1 131 44 1 2
2 3 1061 822 131 77 3 3
3 4 1061 844 6 16 4 4
4 5 1061 846 1 4 6 5
positionText positionOrder points laps time milliseconds \
0 1 1 3 17 25:38.426 1538426
1 2 2 2 17 +1.430 1539856
2 3 3 1 17 +7.502 1545928
3 4 4 0 17 +11.278 1549704
4 5 5 0 17 +24.111 1562537
fastestLap fastestLapTime statusId
0 14 1:30.013 1
1 17 1:29.937 1
2 17 1:29.958 1
3 16 1:30.163 1
4 16 1:30.566 1
==== STATUS ====
Shape: (139, 2)
statusId status
0 1 Finished
1 2 Disqualified
2 3 Accident
3 4 Collision
4 5 Engine
To characterize "race-day performance," we are merging below tables:
results.csv
qualifying.csv
pit_stops.csv
lap_times.csv: average lap time per race
pd.set_option('display.max_columns', None)
results = dataframes["results"][["raceId", "driverId", "constructorId", "grid", "positionOrder", "points", "laps", "time", "milliseconds", "fastestLap", "rank", "fastestLapTime", "fastestLapSpeed", "statusId"]].rename(columns={"rank": "fastesLapRank", "milliseconds": "Total_time_ms", "positionOrder":"finalPosition"}).merge(dataframes["races"][["raceId", "year", "round", "circuitId", "name", "date"]], on="raceId", how="left")
#Merging with status table on statusId
results = results.merge(dataframes["status"], on="statusId", how="left")
# Merge with qualifying data
merged = results.merge(dataframes["qualifying"][["raceId", "driverId", "position", "q1", "q2", "q3"]].rename(columns={"position": "qualifyingPosition"}), on=["raceId", "driverId"], how="left")
merged = merged.merge(dataframes["drivers"][["driverId", "driverRef"]], on="driverId", how="left")
#Merge with pit_stops data
#df_pit = dataframes["pit_stops"].rename(columns={"stop": "pit_stop_number", "lap": "pit_stop_lap", "time":"pit_stop_time", "duration": "pit_stop_duration","milliseconds": "pit_stop_duration_ms"})
#merged = merged.merge(df_pit, on=["raceId", "driverId"], how="left")
pit_summary = dataframes["pit_stops"].copy()
pit_summary["milliseconds"] = pd.to_numeric(pit_summary["milliseconds"], errors="coerce")
pit_summary = pit_summary.groupby(["raceId", "driverId"]).agg(
total_pit_stops=('stop', 'count'),
average_pit_duration_ms=('milliseconds', 'mean')
).reset_index()
merged = merged.merge(pit_summary, on=["raceId", "driverId"], how="left")
#Merge with lap_times data
avg_lap_time = dataframes["lap_times"].groupby(["raceId", "driverId"])["milliseconds"].mean().reset_index(name="avg_lap_time")
avg_lap_time["avg_lap_time_str"] = avg_lap_time["avg_lap_time"].apply(
lambda ms: f"{int(ms // 60000)}:{(ms % 60000) / 1000:.3f}" if pd.notna(ms) else None
)
merged = merged.merge(avg_lap_time.rename(columns = {"avg_lap_time": "avg_lap_time_ms"}), on=["raceId", "driverId"], how="left")
merged_matrix_comp = merged.copy()
merged.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 26759 entries, 0 to 26758 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 raceId 26759 non-null int64 1 driverId 26759 non-null int64 2 constructorId 26759 non-null int64 3 grid 26759 non-null int64 4 finalPosition 26759 non-null int64 5 points 26759 non-null float64 6 laps 26759 non-null int64 7 time 26759 non-null object 8 Total_time_ms 26759 non-null object 9 fastestLap 26759 non-null object 10 fastesLapRank 26759 non-null object 11 fastestLapTime 26759 non-null object 12 fastestLapSpeed 26759 non-null object 13 statusId 26759 non-null int64 14 year 26759 non-null int64 15 round 26759 non-null int64 16 circuitId 26759 non-null int64 17 name 26759 non-null object 18 date 26759 non-null object 19 status 26759 non-null object 20 qualifyingPosition 10494 non-null float64 21 q1 10494 non-null object 22 q2 10472 non-null object 23 q3 10448 non-null object 24 driverRef 26759 non-null object 25 total_pit_stops 5575 non-null float64 26 average_pit_duration_ms 5575 non-null float64 27 avg_lap_time_ms 11041 non-null float64 28 avg_lap_time_str 11041 non-null object dtypes: float64(5), int64(10), object(14) memory usage: 5.9+ MB
merged.isnull().sum()
raceId 0 driverId 0 constructorId 0 grid 0 finalPosition 0 points 0 laps 0 time 0 Total_time_ms 0 fastestLap 0 fastesLapRank 0 fastestLapTime 0 fastestLapSpeed 0 statusId 0 year 0 round 0 circuitId 0 name 0 date 0 status 0 qualifyingPosition 16265 q1 16265 q2 16287 q3 16311 driverRef 0 total_pit_stops 21184 average_pit_duration_ms 21184 avg_lap_time_ms 15718 avg_lap_time_str 15718 dtype: int64
Data Cleaning and Transformation¶
The columns (q1, q2, q3, total_pit_stops, average_pit_duration_ms, average_lap_time_str, avg_lap_time_ms) contains nearly a 50% of missing data. However, these columns are essential for clustering drivers based on performance. Imputing such a large percentage of missing values might introduce significant bias or noise into the analysis.
#dropping the NaN values
merged = merged.dropna(subset=['qualifyingPosition', 'q1', 'q2', 'q3', 'total_pit_stops', 'average_pit_duration_ms', 'avg_lap_time_ms', 'avg_lap_time_str']).reset_index(drop=True)
#Some columns has \N in the data and removing those values with \N.
merged.replace(to_replace='\\N', value=np.nan, inplace=True)
merged.dropna(inplace=True)
# Creating a new column position change between starting grid position and the final position
merged["position_change"] = merged["grid"] - merged["finalPosition"].astype(int)
#Finding if there are still missing values exist
merged.isnull().sum().sum()
0
#checking if there any time columns of type object if so they are converted to milliseconds of type int
merged.select_dtypes(include="object").columns
Index(['time', 'Total_time_ms', 'fastestLap', 'fastesLapRank',
'fastestLapTime', 'fastestLapSpeed', 'name', 'date', 'status', 'q1',
'q2', 'q3', 'driverRef', 'avg_lap_time_str'],
dtype='object')
# Function to convert time column of datatype object to int(milliseconds)
def time_str_to_ms(t):
if pd.isna(t): return np.nan
t = str(t).strip()
if re.match(r"^\d+:\d+\.\d+$", t): # Matches '1:29.844' style
minutes, rest = t.split(":")
seconds = float(rest)
return int(minutes) * 60000 + seconds * 1000
# Apply to time columns of type object to convert into milliseconds of type float.
merged["fastestLapTime_ms"] = merged["fastestLapTime"].apply(time_str_to_ms).astype("Int64")
merged["q1_ms"] = merged["q1"].apply(time_str_to_ms).astype("Int64")
merged["q2_ms"] = merged["q2"].apply(time_str_to_ms).astype("Int64")
merged["q3_ms"] = merged["q3"].apply(time_str_to_ms).astype("Int64")
merged["Total_time_ms"] = pd.to_numeric(merged["Total_time_ms"], errors="coerce").astype("Int64")
merged["fastestLap"] = pd.to_numeric(merged["fastestLap"], errors="coerce").astype("Int64")
merged["fastesLapRank"] = pd.to_numeric(merged["fastesLapRank"], errors="coerce").astype("Int64")
merged["fastestLapSpeed"] = pd.to_numeric(merged["fastestLapSpeed"], errors="coerce").astype("float64")
#Encoding the qualitative variable status
le = LabelEncoder()
merged["status_encoded"] = le.fit_transform(merged["status"])
print(dict(zip(le.classes_, le.transform(le.classes_))))
{'Disqualified': 0, 'Finished': 1, 'Power Unit': 2}
#Dropping the columns which doesn't contain nuymeric data
final_data = merged.drop(columns = ['time', 'fastestLapTime', 'q1', 'q2', 'q3', 'driverRef', 'avg_lap_time_str', 'name', 'status', 'date'])
Columns like raceId, driverId, and constructorId are numerically stored, but: They are still categorical identifiers just with numeric format. Their values do not carry mathematical meaning (e.g., driverId = 10 isn't "twice" driverId = 5). So removing those columns and perfroming SVD on the remaining features.
# raceId, driverId, constructorID,etc. such kind of data are not standardized
scaler = StandardScaler()
X_features = final_data.drop(columns=["raceId", "driverId", "constructorId", "circuitId", "statusId", "round", "status_encoded", "year", "laps"])
X_scaled = StandardScaler().fit_transform(X_features)
data_scaled_df = pd.DataFrame(X_scaled, columns=X_features.columns)
data_scaled_df_copy = data_scaled_df.copy()
#Pre scaled data
final_data.head()
| raceId | driverId | constructorId | grid | finalPosition | points | laps | Total_time_ms | fastestLap | fastesLapRank | fastestLapSpeed | statusId | year | round | circuitId | qualifyingPosition | total_pit_stops | average_pit_duration_ms | avg_lap_time_ms | position_change | fastestLapTime_ms | q1_ms | q2_ms | q3_ms | status_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 841 | 20 | 9 | 1 | 1 | 25.0 | 58 | 5370259 | 44 | 4 | 212.488 | 1 | 2011 | 1 | 1 | 1.0 | 2.0 | 23319.500000 | 92590.672414 | 0 | 89844 | 85296 | 84090 | 83529 | 1 |
| 1 | 841 | 1 | 1 | 2 | 2 | 18.0 | 58 | 5392556 | 41 | 8 | 211.382 | 1 | 2011 | 1 | 1 | 2.0 | 2.0 | 23213.000000 | 92975.103448 | 0 | 90314 | 85384 | 84595 | 84307 | 1 |
| 2 | 841 | 808 | 4 | 6 | 3 | 15.0 | 58 | 5400819 | 55 | 7 | 211.969 | 1 | 2011 | 1 | 1 | 6.0 | 2.0 | 25109.000000 | 93117.568966 | 3 | 90064 | 85543 | 85582 | 85247 | 1 |
| 3 | 841 | 4 | 6 | 5 | 4 | 12.0 | 58 | 5402031 | 49 | 2 | 213.336 | 1 | 2011 | 1 | 1 | 5.0 | 3.0 | 24055.000000 | 93138.465517 | 1 | 89487 | 85707 | 85242 | 84974 | 1 |
| 4 | 841 | 17 | 9 | 3 | 5 | 10.0 | 58 | 5408430 | 50 | 3 | 213.066 | 1 | 2011 | 1 | 1 | 3.0 | 3.0 | 24058.666667 | 93248.793103 | -2 | 89600 | 85900 | 84658 | 84395 | 1 |
#Scaled data
data_scaled_df.head()
| grid | finalPosition | points | Total_time_ms | fastestLap | fastesLapRank | fastestLapSpeed | qualifyingPosition | total_pit_stops | average_pit_duration_ms | avg_lap_time_ms | position_change | fastestLapTime_ms | q1_ms | q2_ms | q3_ms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.290033 | -1.235402 | 1.737068 | -0.439229 | -0.391093 | -0.513603 | 0.236956 | -1.414751 | -0.027390 | -0.269378 | -0.389816 | -0.048314 | -0.065757 | -0.250179 | -0.293561 | -0.325067 |
| 1 | -0.964747 | -0.912358 | 0.794654 | -0.420336 | -0.635854 | 0.465612 | 0.184532 | -1.046952 | -0.027390 | -0.269930 | -0.369424 | -0.048314 | -0.025557 | -0.243077 | -0.252019 | -0.262457 |
| 2 | 0.336397 | -0.589315 | 0.390762 | -0.413335 | 0.506365 | 0.220808 | 0.212356 | 0.424243 | -0.027390 | -0.260102 | -0.361867 | 0.975400 | -0.046940 | -0.230246 | -0.170826 | -0.186809 |
| 3 | 0.011111 | -0.266271 | -0.013130 | -0.412308 | 0.016842 | -1.003211 | 0.277152 | 0.056444 | 1.016529 | -0.265566 | -0.360759 | 0.292924 | -0.096292 | -0.217011 | -0.198795 | -0.208779 |
| 4 | -0.639461 | 0.056772 | -0.282391 | -0.406886 | 0.098430 | -0.758407 | 0.264354 | -0.679154 | 1.016529 | -0.265547 | -0.354907 | -0.730789 | -0.086627 | -0.201436 | -0.246836 | -0.255375 |
MATRIX COMPLETION¶
merged_matrix_comp["fastestLapTime_ms"] = merged_matrix_comp["fastestLapTime"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q1_ms"] = merged_matrix_comp["q1"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q2_ms"] = merged_matrix_comp["q2"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["q3_ms"] = merged_matrix_comp["q3"].apply(time_str_to_ms).astype("Int64")
merged_matrix_comp["Total_time_ms"] = pd.to_numeric(merged_matrix_comp["Total_time_ms"], errors="coerce").astype("Int64")
merged_matrix_comp["fastestLap"] = pd.to_numeric(merged_matrix_comp["fastestLap"], errors="coerce").astype("Int64")
merged_matrix_comp["fastesLapRank"] = pd.to_numeric(merged_matrix_comp["fastesLapRank"], errors="coerce").astype("Int64")
merged_matrix_comp["fastestLapSpeed"] = pd.to_numeric(merged_matrix_comp["fastestLapSpeed"], errors="coerce").astype("float64")
merged_matrix_comp = merged_matrix_comp.drop(columns=["fastestLapTime", "q1", "q2", "q3", "avg_lap_time_str", "time", "name", "date", "status", "driverRef" ])
df_all = merged_matrix_comp.copy().astype("float")
scaler = StandardScaler()
new_df = scaler.fit_transform(df_all)
# Converted to the matrix, and used SoftImpute
soft_imputer = SoftImpute()
X_completed = soft_imputer.fit_transform(new_df)
# Step 4: Convert back to DataFrame
completed_df = pd.DataFrame(X_completed, columns=df_all.columns, index=df_all.index)
[SoftImpute] Max Singular Value of X_init = 266.705073 [SoftImpute] Iter 1: observed MAE=0.027991 rank=23 [SoftImpute] Iter 2: observed MAE=0.027993 rank=23 [SoftImpute] Iter 3: observed MAE=0.027993 rank=23 [SoftImpute] Iter 4: observed MAE=0.027993 rank=23 [SoftImpute] Iter 5: observed MAE=0.027991 rank=23 [SoftImpute] Iter 6: observed MAE=0.027988 rank=23 [SoftImpute] Iter 7: observed MAE=0.027984 rank=23 [SoftImpute] Iter 8: observed MAE=0.027979 rank=23 [SoftImpute] Iter 9: observed MAE=0.027973 rank=23 [SoftImpute] Iter 10: observed MAE=0.027967 rank=23 [SoftImpute] Iter 11: observed MAE=0.027960 rank=23 [SoftImpute] Iter 12: observed MAE=0.027952 rank=23 [SoftImpute] Iter 13: observed MAE=0.027945 rank=23 [SoftImpute] Iter 14: observed MAE=0.027937 rank=23 [SoftImpute] Iter 15: observed MAE=0.027929 rank=23 [SoftImpute] Iter 16: observed MAE=0.027921 rank=23 [SoftImpute] Iter 17: observed MAE=0.027913 rank=23 [SoftImpute] Iter 18: observed MAE=0.027905 rank=23 [SoftImpute] Iter 19: observed MAE=0.027897 rank=23 [SoftImpute] Iter 20: observed MAE=0.027889 rank=23 [SoftImpute] Iter 21: observed MAE=0.027881 rank=23 [SoftImpute] Iter 22: observed MAE=0.027874 rank=23 [SoftImpute] Iter 23: observed MAE=0.027866 rank=23 [SoftImpute] Iter 24: observed MAE=0.027859 rank=23 [SoftImpute] Iter 25: observed MAE=0.027852 rank=23 [SoftImpute] Iter 26: observed MAE=0.027845 rank=23 [SoftImpute] Iter 27: observed MAE=0.027839 rank=23 [SoftImpute] Iter 28: observed MAE=0.027832 rank=23 [SoftImpute] Iter 29: observed MAE=0.027826 rank=23 [SoftImpute] Iter 30: observed MAE=0.027820 rank=23 [SoftImpute] Iter 31: observed MAE=0.027815 rank=23 [SoftImpute] Iter 32: observed MAE=0.027809 rank=23 [SoftImpute] Iter 33: observed MAE=0.027804 rank=23 [SoftImpute] Iter 34: observed MAE=0.027799 rank=23 [SoftImpute] Iter 35: observed MAE=0.027794 rank=23 [SoftImpute] Iter 36: observed MAE=0.027789 rank=23 [SoftImpute] Iter 37: observed MAE=0.027785 rank=23 [SoftImpute] Iter 38: observed MAE=0.027781 rank=23 [SoftImpute] Iter 39: observed MAE=0.027776 rank=23 [SoftImpute] Iter 40: observed MAE=0.027772 rank=23 [SoftImpute] Iter 41: observed MAE=0.027768 rank=23 [SoftImpute] Iter 42: observed MAE=0.027765 rank=23 [SoftImpute] Iter 43: observed MAE=0.027761 rank=23 [SoftImpute] Iter 44: observed MAE=0.027758 rank=23 [SoftImpute] Iter 45: observed MAE=0.027754 rank=23 [SoftImpute] Iter 46: observed MAE=0.027751 rank=23 [SoftImpute] Iter 47: observed MAE=0.027748 rank=23 [SoftImpute] Iter 48: observed MAE=0.027745 rank=23 [SoftImpute] Iter 49: observed MAE=0.027742 rank=23 [SoftImpute] Iter 50: observed MAE=0.027739 rank=23 [SoftImpute] Iter 51: observed MAE=0.027736 rank=23 [SoftImpute] Iter 52: observed MAE=0.027733 rank=23 [SoftImpute] Iter 53: observed MAE=0.027731 rank=23 [SoftImpute] Iter 54: observed MAE=0.027728 rank=23 [SoftImpute] Iter 55: observed MAE=0.027726 rank=23 [SoftImpute] Iter 56: observed MAE=0.027723 rank=23 [SoftImpute] Iter 57: observed MAE=0.027721 rank=23 [SoftImpute] Iter 58: observed MAE=0.027719 rank=23 [SoftImpute] Iter 59: observed MAE=0.027717 rank=23 [SoftImpute] Iter 60: observed MAE=0.027715 rank=23 [SoftImpute] Iter 61: observed MAE=0.027713 rank=23 [SoftImpute] Iter 62: observed MAE=0.027711 rank=23 [SoftImpute] Iter 63: observed MAE=0.027709 rank=23 [SoftImpute] Iter 64: observed MAE=0.027707 rank=23 [SoftImpute] Iter 65: observed MAE=0.027706 rank=23 [SoftImpute] Iter 66: observed MAE=0.027704 rank=23 [SoftImpute] Iter 67: observed MAE=0.027703 rank=23 [SoftImpute] Iter 68: observed MAE=0.027701 rank=23 [SoftImpute] Iter 69: observed MAE=0.027699 rank=23 [SoftImpute] Iter 70: observed MAE=0.027698 rank=23 [SoftImpute] Iter 71: observed MAE=0.027697 rank=23 [SoftImpute] Iter 72: observed MAE=0.027695 rank=23 [SoftImpute] Iter 73: observed MAE=0.027694 rank=23 [SoftImpute] Iter 74: observed MAE=0.027693 rank=23 [SoftImpute] Iter 75: observed MAE=0.027691 rank=23 [SoftImpute] Iter 76: observed MAE=0.027690 rank=23 [SoftImpute] Iter 77: observed MAE=0.027689 rank=23 [SoftImpute] Iter 78: observed MAE=0.027688 rank=23 [SoftImpute] Iter 79: observed MAE=0.027687 rank=23 [SoftImpute] Iter 80: observed MAE=0.027686 rank=23 [SoftImpute] Iter 81: observed MAE=0.027685 rank=23 [SoftImpute] Iter 82: observed MAE=0.027684 rank=23 [SoftImpute] Iter 83: observed MAE=0.027683 rank=23 [SoftImpute] Iter 84: observed MAE=0.027682 rank=23 [SoftImpute] Iter 85: observed MAE=0.027681 rank=23 [SoftImpute] Iter 86: observed MAE=0.027680 rank=23 [SoftImpute] Iter 87: observed MAE=0.027680 rank=23 [SoftImpute] Iter 88: observed MAE=0.027679 rank=23 [SoftImpute] Iter 89: observed MAE=0.027678 rank=23 [SoftImpute] Iter 90: observed MAE=0.027677 rank=23 [SoftImpute] Iter 91: observed MAE=0.027677 rank=23 [SoftImpute] Iter 92: observed MAE=0.027676 rank=23 [SoftImpute] Iter 93: observed MAE=0.027675 rank=23 [SoftImpute] Iter 94: observed MAE=0.027674 rank=23 [SoftImpute] Iter 95: observed MAE=0.027674 rank=23 [SoftImpute] Iter 96: observed MAE=0.027673 rank=23 [SoftImpute] Iter 97: observed MAE=0.027673 rank=23 [SoftImpute] Iter 98: observed MAE=0.027672 rank=23 [SoftImpute] Iter 99: observed MAE=0.027671 rank=23 [SoftImpute] Iter 100: observed MAE=0.027671 rank=23 [SoftImpute] Stopped after iteration 100 for lambda=5.334101
print("Missing values before:", np.isnan(new_df).sum())
print("Missing values after: ", np.isnan(completed_df).sum().sum())
print("Missing %:", np.isnan(new_df).sum() / new_df.size * 100)
Missing values before: 227663 Missing values after: 0 Missing %: 36.990886446981676
Before imputation: 227,663 entries were missing, which accounted for ~37% of the entire feature matrix. After SoftImpute: All missing values have been successfully filled, resulting in a complete matrix with 0% missing data. The rank remained fixed at 23 across all iterations, algorithm settled on a low-rank approximation of the data matrix with 23 latent factors.
SVD¶
np.random.seed(42)
U, S, VT = np.linalg.svd(data_scaled_df, full_matrices=False)
explained_variance = (S ** 2) / np.sum(S ** 2) # Explained variance
svd_projection = U[:, :2] @ np.diag(S[:2]) # 2D SVD Projection
#sns.set(rc={'axes.facecolor': '#fcf0dc'}, style='darkgrid')
cumulative_explained_variance = np.cumsum(explained_variance)
# Plot the cumulative explained variance against the number of components
plt.figure(figsize=(20, 10))
# Bar chart for the explained variance of each component
barplot = sns.barplot(x=list(range(1, len(cumulative_explained_variance) + 1)),
y=explained_variance,
alpha=0.8)
# Line plot for the cumulative explained variance
lineplot, = plt.plot(range(0, len(cumulative_explained_variance)), cumulative_explained_variance,
marker='o', linestyle='--', linewidth=2)
# Set labels and title
plt.xlabel('Number of Components', fontsize=14)
plt.ylabel('Explained Variance', fontsize=14)
plt.title('Cumulative Variance vs. Number of Components', fontsize=18)
# Customize ticks and legend
plt.xticks(range(0, len(cumulative_explained_variance)))
plt.legend(handles=[barplot.patches[0], lineplot],
labels=['Explained Variance of Each Component', 'Cumulative Explained Variance'],
loc=(0.62, 0.1),
frameon=True,
framealpha=1.0,
)
# Display the variance values for both graphs on the plots
x_offset = -0.3
y_offset = 0.01
for i, (ev_ratio, cum_ev_ratio) in enumerate(zip(explained_variance, cumulative_explained_variance)):
plt.text(i, ev_ratio, f"{ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
if i > 0:
plt.text(i + x_offset, cum_ev_ratio + y_offset, f"{cum_ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
plt.grid(axis='both')
plt.show()
The plot and the cumulative explained variance values indicate how much of the total variance in the dataset is captured by each component, as well as the cumulative variance explained by the first n components.
Here, we can observe that:
The first component explains approximately 31% of the variance.
The first two components together explain about 54% of the variance.
The first six components explain approximately 88% of the variance, and so on.
Choosing optimal number of components as 6 after which adding another component doesn't significantly increase in the cummulative variance, often referred to as the "elbow point" in the curve.
From the plot, we can see that the increase in cumulative variance starts to slow down after the 6th component (which captures about 88% of the total variance) and also adding another component has no change in the variance of the individua component.
PCA¶
# performing PCA with 9 components
pca = PCA(n_components=6)
F1_data_pca = pca.fit_transform(data_scaled_df)
# Creating a new dataframe from the PCA dataframe, with columns labeled PC1, PC2, etc.
F1_data_pca = pd.DataFrame(F1_data_pca, columns=['PC'+str(i+1) for i in range(pca.n_components_)])
id_index = final_data[["raceId", "driverId"]].reset_index(drop=True)
pca_df = pd.concat([id_index, F1_data_pca], axis=1).set_index(["raceId", "driverId"])
pca_df = pca_df.reset_index()
pca_df.head()
| raceId | driverId | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 841 | 20 | -1.454654 | -2.467678 | -0.395891 | -0.405817 | 0.197826 | 0.369316 |
| 1 | 841 | 1 | -0.852697 | -1.307286 | -0.500208 | -0.520739 | 0.199261 | 0.652279 |
| 2 | 841 | 808 | -0.563475 | 0.022435 | -0.643267 | 1.149230 | -0.113728 | -0.262184 |
| 3 | 841 | 4 | -0.646255 | -0.312125 | -0.114757 | 0.235873 | 0.206241 | -1.254385 |
| 4 | 841 | 17 | -0.756625 | -0.405742 | 0.096841 | -1.191356 | 0.212870 | -1.058389 |
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Reduce to 2 dimensions for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(data_scaled_df)
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
# --- Plot 1: PCA projection (PC1 vs PC2)
axs[0].scatter(F1_data_pca['PC1'], F1_data_pca['PC2'], alpha=0.6, color='cornflowerblue')
axs[0].set_title("PCA Projection: PC1 vs PC2")
axs[0].set_xlabel("Principal Component 1")
axs[0].set_ylabel("Principal Component 2")
axs[0].grid(True)
# --- Plot 2: Original features (avg_lap_time vs average_pit_duration)
axs[1].scatter(data_scaled_df_copy['avg_lap_time_ms'], data_scaled_df_copy['q3_ms'], alpha=0.6, color='darkorange')
axs[1].set_title("Original Feature Space")
axs[1].set_xlabel("Average Lap Time (ms) [standardized]")
axs[1].set_ylabel("Average q3 Time (ms) [standardized]")
axs[1].grid(True)
plt.tight_layout()
plt.show()
There is no clear seperation between the data when we visualize the 2 components of PCA. Next we try to perform K-mean on this data to visualize the clusters.
def highlight_top3(column):
top3 = column.abs().nlargest(3).index
return ['background-color: #ffeacc' if i in top3 else '' for i in column.index]
# Create the PCA component DataFrame and apply the highlighting function
pc_df = pd.DataFrame(pca.components_.T, columns=['PC{}'.format(i+1) for i in range(pca.n_components_)],
index=data_scaled_df.columns)
pc_df.style.apply(highlight_top3, axis=0)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| grid | 0.152734 | 0.378691 | -0.078314 | 0.447781 | 0.003427 | -0.123384 |
| finalPosition | 0.167506 | 0.453206 | 0.019497 | -0.263958 | 0.028400 | -0.081915 |
| points | -0.164398 | -0.458953 | -0.004199 | 0.197710 | -0.021615 | 0.090173 |
| Total_time_ms | 0.052056 | 0.021915 | 0.597753 | 0.102366 | -0.173509 | 0.187021 |
| fastestLap | -0.241297 | 0.078886 | 0.111464 | 0.032653 | -0.282897 | -0.381710 |
| fastesLapRank | 0.155376 | 0.357945 | -0.063381 | -0.091202 | -0.045254 | 0.370206 |
| fastestLapSpeed | -0.033061 | -0.016724 | -0.233901 | -0.012725 | 0.792781 | -0.209529 |
| qualifyingPosition | 0.160590 | 0.416327 | -0.053051 | 0.290780 | -0.010428 | -0.067910 |
| total_pit_stops | 0.004263 | 0.017461 | 0.376780 | -0.077143 | 0.031884 | -0.715291 |
| average_pit_duration_ms | 0.027882 | 0.019915 | 0.466359 | 0.109448 | 0.462281 | 0.228829 |
| avg_lap_time_ms | 0.311299 | -0.094740 | 0.409112 | 0.076097 | 0.168784 | 0.060818 |
| position_change | -0.016716 | -0.081469 | -0.102749 | 0.748564 | -0.026403 | -0.042905 |
| fastestLapTime_ms | 0.424442 | -0.162026 | -0.046598 | -0.014508 | -0.069647 | 0.007642 |
| q1_ms | 0.419856 | -0.177818 | -0.080431 | -0.030560 | -0.059503 | -0.097045 |
| q2_ms | 0.422749 | -0.175166 | -0.086161 | -0.028154 | -0.056222 | -0.105239 |
| q3_ms | 0.420119 | -0.165089 | -0.093527 | -0.012085 | -0.049619 | -0.116594 |
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
scaling_factor = 9
# Intersect with valid raceIds
selected_drivers = ['hamilton', 'vettel', 'max_verstappen', 'alonso', 'ricciardo']
#selected_races = [841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 861, 862, 863, 864, 865, 866, 867, 868, 869] # Example raceIds
#selected_races = list(range(1000, 1010))
selected_races = list(range(880, 900))
# Filter the merged dataframe
subset = merged[(merged['driverRef'].isin(selected_drivers)) & (merged['raceId'].isin(selected_races))].copy()
# Get the aligned scaled data
subset_scaled = data_scaled_df_copy.loc[subset.index]
subset_pca = pd.DataFrame(pca_2d.transform(subset_scaled), columns=['PC1', 'PC2'], index=subset.index)
colors = plt.cm.get_cmap("tab10", 10)
# Plot setup
fig, ax = plt.subplots(figsize=(12, 8))
# Plot PCA scores for each driver
for i, driver in enumerate(selected_drivers):
idx = subset[subset['driverRef'] == driver].index
ax.scatter(subset_pca.loc[idx, 'PC1'], subset_pca.loc[idx, 'PC2'],
color=colors(i), label=driver, alpha=0.8, s=60)
line_length = 10 # Keeps arrow/line the same # Keeps arrow/line the same
label_offset = 10 # Increase this to spread out text more
for i, (x_comp, y_comp) in enumerate(pca_2d.components_.T):
# Line (arrow without head)
ax.plot([0, x_comp * line_length], [0, y_comp * line_length], color='black', linewidth=1.5)
# Text with increased spacing and rotation
angle_rad = np.arctan2(y_comp, x_comp)
angle_deg = np.degrees(angle_rad)
# Flip upside-down labels to keep them upright
if angle_deg > 90 or angle_deg < -90:
angle_deg += 180
ax.text(x_comp * label_offset,
y_comp * label_offset,
data_scaled_df_copy.columns[i],
fontsize=11, color='darkred',
rotation=angle_deg, rotation_mode='anchor',
ha='center', va='center')
# Annotate 10 driver-race combos
#np.random.seed(42)
label_idx = np.random.choice(subset.index, size=30, replace=False)
for i in label_idx:
label = subset.loc[i, 'driverRef'] + "_race" + str(subset.loc[i, 'raceId'])
ax.text(subset_pca.loc[i, 'PC1'], subset_pca.loc[i, 'PC2'], label, fontsize=9)
# Aesthetics
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("PCA Projection with driver/race combo")
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.grid(True)
ax.legend(title="Driver")
plt.tight_layout()
plt.show()
/var/folders/y4/fkn560k16c3__twnyy0m4jxr0000gn/T/ipykernel_16833/4015335785.py:17: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
colors = plt.cm.get_cmap("tab10", 10)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
scaling_factor = 9
line_length = 10
label_offset = 10
# Define your selected drivers and races
selected_drivers = ['vettel']
#selected_races = list(range(880, 900))
selected_races = [883, 888, 893, 898, 899]
# Filter the merged dataframe for selected drivers and races
subset_all = merged[(merged['driverRef'].isin(selected_drivers)) & (merged['raceId'].isin(selected_races))].copy()
# Filter only winning entries (final position == 1)
subset = subset_all[subset_all['finalPosition'] == 1].copy()
# Get the aligned scaled data
subset_scaled = data_scaled_df_copy.loc[subset.index]
subset_pca = pd.DataFrame(pca_2d.transform(subset_scaled), columns=['PC1', 'PC2'], index=subset.index)
colors = plt.cm.get_cmap("tab10", 10)
# Plot setup
fig, ax = plt.subplots(figsize=(12, 8))
# Plot PCA scores for each winner
for i, driver in enumerate(selected_drivers):
idx = subset[subset['driverRef'] == driver].index
ax.scatter(subset_pca.loc[idx, 'PC1'], subset_pca.loc[idx, 'PC2'],
color=colors(i), label=driver, alpha=0.8, s=60)
# Plot PCA directions
for i, (x_comp, y_comp) in enumerate(pca_2d.components_.T):
ax.plot([0, x_comp * line_length], [0, y_comp * line_length], color='black', linewidth=1.5)
angle_rad = np.arctan2(y_comp, x_comp)
angle_deg = np.degrees(angle_rad)
if angle_deg > 90 or angle_deg < -90:
angle_deg += 180
ax.text(x_comp * label_offset,
y_comp * label_offset,
data_scaled_df_copy.columns[i],
fontsize=11, color='darkred',
rotation=angle_deg, rotation_mode='anchor',
ha='center', va='center')
# Label the winning driver-race combos
for i in subset.index:
label = subset.loc[i, 'driverRef'] + "_race" + str(subset.loc[i, 'raceId'])
ax.text(subset_pca.loc[i, 'PC1'], subset_pca.loc[i, 'PC2'], label, fontsize=9)
# Aesthetics
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("PCA Projection (Only Race Winners Shown)")
ax.axhline(0, color='gray', lw=0.5)
ax.axvline(0, color='gray', lw=0.5)
ax.grid(True)
ax.legend(title="Driver")
plt.tight_layout()
plt.show()
/var/folders/y4/fkn560k16c3__twnyy0m4jxr0000gn/T/ipykernel_16833/3071724993.py:25: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
colors = plt.cm.get_cmap("tab10", 10)
K Means¶
Determining the optimal number of clusters¶
ks = range(2, 15)
wss = []
silhouette_scores = []
for k in ks:
kmeans = KMeans(n_clusters=k, n_init=20, random_state=42)
labels = kmeans.fit_predict(F1_data_pca)
wss.append(kmeans.inertia_)
sil_score = silhouette_score(F1_data_pca, labels)
silhouette_scores.append(sil_score)
# Plot
fig, ax1 = plt.subplots(figsize=(10, 5))
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Within-Cluster Sum of Squares')
ax1.plot(ks, wss, marker='o', label='WSS')
ax1.tick_params(axis='y')
ax2 = ax1.twinx()
ax2.set_ylabel('Silhouette Score')
ax2.plot(ks, silhouette_scores, marker='s', linestyle='--', label='Silhouette')
ax2.tick_params(axis='y')
plt.title('Clustering Performance vs Number of Clusters (k)')
fig.tight_layout()
plt.grid(True)
plt.show()
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=9, max_iter=100, random_state=0)
kmeans.fit(F1_data_pca)
# Optional: get cluster frequencies
cluster_frequencies = Counter(kmeans.labels_)
# Optional: sort clusters by size (optional relabeling)
sorted_clusters = [c for c, _ in cluster_frequencies.most_common()]
relabel = {old: new for new, old in enumerate(sorted_clusters)}
new_labels = np.array([relabel[l] for l in kmeans.labels_])
# Add cluster labels
F1_data_pca["cluster"] = new_labels
data_scaled_df["cluster"] = new_labels
data_scaled_df.head()
| grid | finalPosition | points | Total_time_ms | fastestLap | fastesLapRank | fastestLapSpeed | qualifyingPosition | total_pit_stops | average_pit_duration_ms | avg_lap_time_ms | position_change | fastestLapTime_ms | q1_ms | q2_ms | q3_ms | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.290033 | -1.235402 | 1.737068 | -0.439229 | -0.391093 | -0.513603 | 0.236956 | -1.414751 | -0.027390 | -0.269378 | -0.389816 | -0.048314 | -0.065757 | -0.250179 | -0.293561 | -0.325067 | 0 |
| 1 | -0.964747 | -0.912358 | 0.794654 | -0.420336 | -0.635854 | 0.465612 | 0.184532 | -1.046952 | -0.027390 | -0.269930 | -0.369424 | -0.048314 | -0.025557 | -0.243077 | -0.252019 | -0.262457 | 0 |
| 2 | 0.336397 | -0.589315 | 0.390762 | -0.413335 | 0.506365 | 0.220808 | 0.212356 | 0.424243 | -0.027390 | -0.260102 | -0.361867 | 0.975400 | -0.046940 | -0.230246 | -0.170826 | -0.186809 | 0 |
| 3 | 0.011111 | -0.266271 | -0.013130 | -0.412308 | 0.016842 | -1.003211 | 0.277152 | 0.056444 | 1.016529 | -0.265566 | -0.360759 | 0.292924 | -0.096292 | -0.217011 | -0.198795 | -0.208779 | 0 |
| 4 | -0.639461 | 0.056772 | -0.282391 | -0.406886 | 0.098430 | -0.758407 | 0.264354 | -0.679154 | 1.016529 | -0.265547 | -0.354907 | -0.730789 | -0.086627 | -0.201436 | -0.246836 | -0.255375 | 0 |
colors = ['#e8000b', '#1ac938', '#023eff', '#000000']
cluster_0 = F1_data_pca[F1_data_pca['cluster'] == 0]
cluster_1 = F1_data_pca[F1_data_pca['cluster'] == 1]
cluster_2 = F1_data_pca[F1_data_pca['cluster'] == 2]
cluster_3 = F1_data_pca[F1_data_pca['cluster'] == 3]
# Create a 3D scatter plot
fig = go.Figure()
# Add data points for each cluster separately and specify the color
fig.add_trace(go.Scatter3d(x=cluster_0['PC1'], y=cluster_0['PC2'], z=cluster_0['PC3'],
mode='markers', marker=dict(color=colors[0], size=5, opacity=0.4), name='Cluster 0'))
fig.add_trace(go.Scatter3d(x=cluster_1['PC1'], y=cluster_1['PC2'], z=cluster_1['PC3'],
mode='markers', marker=dict(color=colors[1], size=5, opacity=0.4), name='Cluster 1'))
fig.add_trace(go.Scatter3d(x=cluster_2['PC1'], y=cluster_2['PC2'], z=cluster_2['PC3'],
mode='markers', marker=dict(color=colors[2], size=5, opacity=0.4), name='Cluster 2'))
fig.add_trace(go.Scatter3d(x=cluster_3['PC1'], y=cluster_3['PC2'], z=cluster_3['PC3'],
mode='markers', marker=dict(color=colors[3], size=5, opacity=0.4), name='Cluster 3'))
# Set the title and layout details
fig.update_layout(
title=dict(text='3D Visualization of Clusters in PCA Space', x=0.5),
scene=dict(
xaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC1'),
yaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC2'),
zaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC3'),
),
width=900,
height=800
)
# Show the plot
fig.show()
num_observations = len(F1_data_pca)
# Separate the features and the cluster labels
X = F1_data_pca.drop('cluster', axis=1)
clusters = F1_data_pca['cluster']
# Compute the metrics
sil_score = silhouette_score(X, clusters)
# Create a table to display the metrics and the number of observations
table_data = [
["Number of Observations", num_observations],
["Silhouette Score", sil_score]
]
# Print the table
print(tabulate(table_data, headers=["Metric", "Value"], tablefmt='pretty'))
colors = ['#e8000b', '#1ac938', '#023eff', '#000000']
# Standardize the data (excluding the cluster column)
#scaler = StandardScaler()
df_standardized = data_scaled_df.copy()
# Create a new dataframe with standardized values and add the cluster column back
df_standardized = pd.DataFrame(data_scaled_df, columns=data_scaled_df.columns[:-1], index=data_scaled_df.index)
df_standardized['cluster'] = data_scaled_df['cluster']
# Calculate the centroids of each cluster
cluster_centroids = df_standardized.groupby('cluster').mean()
# Function to create a radar chart
def create_radar_chart(ax, angles, data, color, cluster):
# Plot the data and fill the area
ax.fill(angles, data, color=color, alpha=0.4)
ax.plot(angles, data, color=color, linewidth=2, linestyle='solid')
# Add a title
ax.set_title(f'Cluster {cluster}', size=20, color=color, y=1.1)
# Set data
labels=np.array(cluster_centroids.columns)
num_vars = len(labels)
# Compute angle of each axis
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
# The plot is circular, so we need to "complete the loop" and append the start to the end
labels = np.concatenate((labels, [labels[0]]))
angles += angles[:1]
# Initialize the figure
fig, ax = plt.subplots(figsize=(20, 10), subplot_kw=dict(polar=True), nrows=1, ncols=4)
# Create radar chart for each cluster
for i, color in enumerate(colors):
data = cluster_centroids.loc[i].tolist()
data += data[:1] # Complete the loop
create_radar_chart(ax[i], angles, data, color, i)
# Add input data
ax[0].set_xticks(angles[:-1])
ax[0].set_xticklabels(labels[:-1])
ax[1].set_xticks(angles[:-1])
ax[1].set_xticklabels(labels[:-1])
ax[2].set_xticks(angles[:-1])
ax[2].set_xticklabels(labels[:-1])
ax[3].set_xticks(angles[:-1])
ax[3].set_xticklabels(labels[:-1])
# Add a grid
ax[0].grid(color='grey', linewidth=0.5)
# Display the plot
plt.tight_layout()
plt.show()
Interpretation from radar chart
Cluster 0 (Red)
High on: fastestLap, points, positionOrder(final position)
Low on: position_change (negative) and qualifyingPosition
Interpretation:
These drivers tend to finish at the top with fast lap times and strong race-day performance, even if they lose positions relative to their start. Possibly consistent, fast finishers.
Cluster 1 (Green)
High on: q1_ms, q2_ms, q3_ms (longer times), position_change
Low on: race outcomes like points, positionOrder, grid\
Interpretation: Drivers in this group qualify poorly (slow qualifying laps) and often start at the back — but show strong racecraft, gaining positions during the race. Possibly skilled midfielders or rookies.
Cluster 2 (Blue)
High on: fastestLapSpeed, qualifyingPosition, positionOrder
Low on: points, pit_stops, avg_lap_time_ms
Interpretation: Drivers in this cluster are excellent qualifiers, with strong lap speeds, but may not convert those into high points. Possibly inconsistent performers or drivers in weaker cars.
Cluster 3 (Black)
High on: average_pit_duration_ms, total_pit_stops, Total_time_ms
Low on: everything else
Interpretation: Likely poor performers or drivers impacted by long pit stops, technical issues, or penalties. These could be backmarkers or those who don’t finish well
Hierarchical Clustering¶
linkage_methods = ['single', 'complete', 'average', 'centroid']
cluster_numbers = [2, 3, 4, 5, 6]
# Looping through each types of linkage
for method in linkage_methods:
Z = linkage(X_scaled, method=method, metric='euclidean')
# Ploting all dendrogram
plt.figure(figsize=(15, 8))
dendrogram(
Z, leaf_rotation=90, leaf_font_size=6, color_threshold=None, no_labels=True
)
plt.title(f"{method.capitalize()} Linkage Dendrogram")
plt.xlabel("Cluster ID")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
# Trying out different numbers of flat clusters
print(f"\n {method.capitalize()} linkage:")
for k in cluster_numbers:
labels = fcluster(Z, t=k, criterion='maxclust')
# bincount returns the counts for labels 1… k
sizes = np.bincount(labels)[1:]
print(f"k = {k:>2} -> cluster sizes: {sizes.tolist()}")
Key Observations:
--- Single linkage :
- Extremely imbalanced clusters at every k
- One massive cluster plus very small clusters observed
--- Complete linkage
- It Produces more balanced mid-sized clusters at moderate values of k like 4, 5, 6
- It still isolates some true outliers but captures two or three main grouping among the bulk of the data.
--- Average and Centroid linkage
- Both are giving outcomes similar to the single linkage.
- One dominating cluster and other very small clusters/ few outliers.
Scaling¶
drop_cols = [
"raceId","constructorId","circuitId", "statusId","round","year","laps","hc_cluster"
]
# Here, numeric features only considered, then dropping unwanted columns
numeric_cols = final_data.select_dtypes(include="number").columns
features = final_data[numeric_cols].drop(columns=drop_cols, errors="ignore")
# Selected random sub-sample of 100 rows for dendrogram
n_samples = 100
sub_idx = np.random.choice(features.index, size=n_samples, replace=False)
X_sub = features.loc[sub_idx]
driver_ids = final_data.loc[sub_idx, 'driverId'].astype(str).values
# Now, lets scale sub-sampled data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sub)
scaled_df = pd.DataFrame(X_scaled, columns=X_sub.columns, index=X_sub.index)
# Linkage computation and dendrograms plots
linkage_methods = ['single','complete','average','centroid']
for method in linkage_methods:
Z = linkage(X_scaled, method=method, metric='euclidean')
plt.figure(figsize=(16, 6))
dendrogram(
Z,
labels = driver_ids,
leaf_rotation=90,
leaf_font_size=8,
)
plt.title(f"{method.capitalize()} Linkage (truncated) with driver IDs")
plt.xlabel("Driver ID")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
# 4) Extract flat clusters on the full linkage of the sub-sample
print(f"\n{method.capitalize()} linkage flat clusters (n={n_samples}):")
for k in [2, 3, 4, 5, 6]:
labels = fcluster(Z, t=k, criterion='maxclust')
sizes = np.bincount(labels)[1:]
print(f"k={k:<2} -> sizes = {sizes.tolist()}")
We first considered 100 sub-sampled data, scaled the subsampled numeric features dropping non-numeric columns.
Effect of feature scaling on:
--- Single:
- Normalizing the features slightly changes the linkage distances still forms one giant cluster containing almost all points, with each outlier split off.
--- Complete:
- More balanced groupings as scaling yields two or three mid-sized clusters instead of a single massive one.
- Outliers continue to be isolated as tiny clusters, but is now spread more evenly across clusters.
- It seems to be the best linkage type for this dataset as it is properly or efficiently dividing the scaled data into clusters.
--- Average:
- Still produces one dominant cluster holding most points, plus a smaller secondary cluster and a few singleton outliers.
--- Centroid:
- Scaling makes centroid calculations more stable, but the big cluster and outliers persists.
--- Difference observed after scaling data:
- Scaling balances the influence of each feature, so no single variable with a large range dominates the clustering.
- Complete linkage benefits most, producing noticeably more even cluster sizes once features are standardized.
- Single, average, and centroid linkages remain largely unchanged in pattern.